Summary

### The Setting, Revisited

The reinforcement learning (RL) framework is characterized by an agent learning to interact with its environment .
At each time step, the agent receives the environment's state ( the environment presents a situation to the agent) , and the agent must choose an appropriate action in response. One time step later, the agent receives a reward ( the environment indicates whether the agent has responded appropriately to the state ) and a new state .
All agents have the goal to maximize expected cumulative reward , or the expected sum of rewards attained over all time steps.

A task is an instance of the reinforcement learning (RL) problem.
Continuing tasks are tasks that continue forever, without end.
Episodic tasks are tasks with a well-defined starting and ending point.
- In this case, we refer to a complete sequence of interaction, from start to finish, as an episode .
- Episodic tasks come to an end whenever the agent reaches a terminal state .

Reward Hypothesis : All goals can be framed as the maximization of (expected) cumulative reward.

(Please see Part 1 and Part 2 to review an example of how to specify the reward signal in a real-world problem.)

The return at time step t is G_t := R_{t+1} + R_{t+2} + R_{t+3} + \ldots
The agent selects actions with the goal of maximizing expected (discounted) return. ( Note: discounting is covered in the next concept. )

The discounted return at time step t is G_t := R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots .
The discount rate \gamma is something that you set, to refine the goal that you have the agent.
- It must satisfy 0 \leq \gamma \leq 1 .
- If \gamma=0 , the agent only cares about the most immediate reward.
- If \gamma=1 , the return is not discounted.
- For larger values of \gamma , the agent cares more about the distant future. Smaller values of \gamma result in more extreme discounting, where - in the most extreme case - agent only cares about the most immediate reward.

The state space \mathcal{S} is the set of all ( nonterminal ) states.
In episodic tasks, we use \mathcal{S}^+ to refer to the set of all states, including terminal states.
The action space \mathcal{A} is the set of possible actions. (Alternatively, \mathcal{A}(s) refers to the set of possible actions available in state s \in \mathcal{S} .)
(Please see Part 2 to review how to specify the reward signal in the recycling robot example.)
The one-step dynamics of the environment determine how the environment decides the state and reward at every time step. The dynamics can be defined by specifying p(s',r|s,a) \doteq \mathbb{P}(S_{t+1}=s', R_{t+1}=r|S_{t} = s, A_{t}=a) for each possible s', r, s, \text{and } a .
A (finite) Markov Decision Process (MDP) is defined by:
- a (finite) set of states \mathcal{S} (or \mathcal{S}^+ , in the case of an episodic task)
- a (finite) set of actions \mathcal{A}
- a set of rewards \mathcal{R}
- the one-step dynamics of the environment
- the discount rate \gamma \in [0,1]